% This is LLNCS.DEM the demonstration file of
% the LaTeX macro package from Springer-Verlag
% for Lecture Notes in Computer Science,
% version 2.4 for LaTeX2e as of 16. April 2010
%
\documentclass{llncs}
%
%\usepackage{makeidx}  % allows for indexgeneration
\usepackage[utf8]{inputenc} % support for Portuguese accentuation
\usepackage{nicefrac}
\usepackage{graphicx} % support for imaging
\usepackage{ftnxtra}
\usepackage{fnpos} % \makeFNbelow by default
\usepackage{float}
\usepackage[hyphens]{url}
\usepackage{fixltx2e}
\usepackage{enumerate}
%
\begin{document}
%
\frontmatter          % for the preliminaries
%
\pagestyle{headings}  % switches on printing of running heads
\pagenumbering{arabic}
\title{Semantics for Big Data In the Energy Domain}
%
\author{Pedro Martins Correia}
%
\authorrunning{Pedro Martins Correia} % abbreviated author list (for running head)
%
%%%% list of authors for the TOC (use if author list has to be modified)
\tocauthor{Pedro Martins Correia}
%
\institute{Departamento de Engenharia Informática,\\ Instituto Superior Técnico, UTL, Lisbon, Portugal,\\
	\email{pedro.p.correia@ist.utl.pt}}

\maketitle              % typeset the title of the contribution

\begin{abstract}
Energy B2C (Business to Consumer) customers have it difficult to notice when their consumption pattern changes because they only realize how much energy they have consumed once per month, at billing time. This work aims at studying ways of processing customer's \nicefrac{1}{4} hour interval meter readings, drawing conclusions on consumption pattern evolution and comparing it with other customers with similar consumption patterns. This work will result in an application that will give advices to customers, depending on the conclusions drawn from their consumption pattern changes. These advices will be entailed from the data collected from an ontology that will be built to represent the entities involved in this domain (e.g.: customer, consumption).
\keywords{Semantics for Big Data, Ontology Engineering, Energy advisor}
\end{abstract}
%
\section{Introduction and Motivation}
The expression Big Data was first used by Cox and Ellsworth in 1997 while solving the problem of scientific visualization of Computational Fluid Dynamics (CFD). The input data sets used in such task could surpass 100 Gbytes and scaling to the ability of a supercomputer to generate them. Cox and Ellsworth proposed a solution by resorting to out-of-core techniques to overcome the problem of processing information too large to fit into a computer's main memory \cite{cox:ells}. In few words, Big Data is a term used to describe data that is too large and complex to be stored by a regular database (or memory) and to be processed by traditional application techniques. Nowadays, the most evident Big Data processing is on CERN (European Council for Nuclear Research) ATLAS (A Toroidal LHC ApparatuS) Experiment\footnote{\url{http://home.web.cern.ch/about/experiments/atlas}}. Its goal is to detect particle collisions (mostly protons) in LHC (Large Hadron Collider) at CERN which aims to extends the frontiers of particle physics such as to prove the existence of the Higgs boson. Inside LHC, up to 10\textsuperscript{11} protons are accelerated to collide 40 million times per second. Data is gathered from particle collisions using about 150 million sensors, capable of measuring 100 thousand events every second and generating roughly 1 Petabyte (10\textsuperscript{15} bytes - the equivalent to 10100 single-sided DVDs) of raw data per second, producing annually 30 Petabytes of data annually as of April 2013, amounting to 140 Petabytes total. CERN expects a rate of data influx into ATLAS Grid about 40 Petabytes per year in 2015. It processes data on the Titan System (Cray XK7 supercomputer) witch is composed by 18,688 computer notes (27.1 Petaflops) and 710 Terabytes total memory. A trigger system is used to filter the raw data because such amount of information is too much to store and process \cite{panda,atlas}. This system consists of three levels of event selection which together reduce the event frequency to approximately 200 events per second with an average event processing time of order four seconds\cite{atlas}. Although in ATLAS data can be processed well after the experiment, on a non-stop news stream such as Thomson Reuters which in 2013 delivered over 2 million unique news stories, nearly  900,000 news alerts, over 500,000 pictures and roughly 100,0000 video stories. Processing Big Data require domain specific techniques. Merging raw data into smaller chunks of information or production of concepts able to abstract details in raw data are the most commonly used techniques, being the main goals data size and complexity reduction.
\begin{figure}[h]
	\centering
	\includegraphics[width=0.8\textwidth]{images/GR2011021100614.jpg}
	\caption{World's capacity to store information}
	\label{fig:world_information_storage}
\end{figure}
Figure \ref{fig:world_information_storage}, from Washington Post\footnote{\url{http://www.washingtonpost.com/wp-dyn/content/article/2011/02/10/AR2011021004916.html}}, illustrates the results on a research taken place  at the University of Southern California \cite{worldstore}. It depicts the data production rate by estimating worldwide storage capacity with 7 years interval from 1986 to 2007 and highlighting the trend for growing faster at each time it was calculated. From year 2000 data growth is clearly much greater than before. This fact is coincident with Big Data term emergency in 1997, little before year 2000. According to Kryder’s Law, hard drive density tends to double every 13 months while, according to Moore’s Law, computational power tends to double every 2 years which is almost half of the expected storage growth rate. The primary characteristics of Big Data are mostly known by its 5 V’s, being the first 3 V’s originally introduced by Doug Laney of Gartner \cite{physicalcybersocial}:\\
\begin{description}
\item[Volume] This is the main feature where, nowadays, data set can amount to the Petabytes of information. The volume of data to process creates the challenges of being able to start from fine-grained information and summarize it to human comprehensible and to scale computations to the available processing infrastructure allowing to reason on devices with limited resources such as mobile devices (e.g.: cellular, smart-watch) or an airplane.
\item[Variety] Data can be structured, semi-structured or unstructured. Most of the data sets are unstructured which is typically plain text \cite{realtimerdf}, not having any kind of identification as opposition to structured where data has a predefined data model, such as a relational database. On a semi-structured data information is organized in a predefined manner, but unlike structured data the same class may have different at-tributes. XML\footnote{\url{http://www.w3.org/XML/}} is well-suited for semi-structured data \cite{semiToXML}, because the standard puts no restrictions on the tags or nesting, giving the freedom for users to change their data while not needing to update a schema. The Web content is defined mostly in HTML\footnote{\url{http://www.w3.org/html/}} which is based on XML, although because it defines mostly layout and user interface behavior, the actual web page content is plain text within XML tags or, on extreme situations intermixed with images making it hard to discover its meaning.
\item[Velocity] The rate at which data is being generated is too high for real-time processing, requiring to filter raw data and only process the information that really matters. Such as what happens at CERN ATLAS Experiment, the raw data from sensors is generated at a rate higher than the ability to store it and even higher rate to be able to process it, even with 27.1 Petaflops\footnote{\url{http://en.wikipedia.org/wiki/FLOPS}} and 710 Terabytes of system memory \cite{panda}. For that reason, on ATLAS sensors only gather information considered relevant to the experiment.
\item[Value] Fine-grained data is unintelligible for humans and needs to be processed and interpreted by a machine to make it intelligible, thus valuable, which requires to be able to realize: data acquisition, identification of relevant knowledge within the raw data, construction and application of models for concept analysis.
\item[Veracity] Data gathered from sensors may be obtained from defective sensors or being affected by some temporary situation one cannot predict or control, compromising the quality of the information thus introducing complexity to the models. Statistical methods can be used to mitigate such events on homogeneous sensor networks while semantic models are necessary for heterogeneous ones. An example of a heterogeneous sensor network is, for instance, an airplane which has multiple types of sensors may be gathering, some for mechanical and hydraulic parts, others for geographic localization while others indicate human actions (e.g.: pilot increasing engine throttle) contradicting conclusions inferred from the information gathered by the other sensors.
\end{description}

Nowadays, detailed energy consumption information, such as customer's electricity load curves, is currently being ignored, considered garbage by some companies and only merged information is regarded. By reducing the amount of information to only the minimum necessary for billing, valuable details that lie within are lost. In the electricity domain there is Big Data for example in data gathered from all the sensors, spread throughout the supply network, to the control centers. Processing all the data that sensors gather, in real-time, would allow the control centers to make sure energy quality is met according to the regulatory country by immediately isolating problematic zones, for instance, outages\cite{outageDetection} or power quality degradation. For example, in Portugal, ERSE\footnote{\url{http://www.erse.pt/}} (Entidade Reguladora dos Serviços Energéticos), imposes that the energy supplied to customers meets with regulatory metrics, based on parameters such as tension level and Service Quality Zone\footnote{\url{http://www.erse.pt/eng/electricity/servicequality/Documents/SE-QS-QST-EN-VF.pdf}}, number and duration of supply interruptions, electric signal amplitude, frequency, wave form (e.g.: noise, sag, swell of sine wave) and symmetry of the three-phase voltage. Some types of customers (e.g.: Semiconductor factories due to the sensitivity of equipment and process controls) demand, by contract, supply of energy with higher levels of quality possibly resulting on a network node reorganization enabling the company to assure the delivery of such quality levels. Unless there is a system able to quickly analyze sensor data and draw conclusions on time to take action, the supplier company may incur to service quality non-compliance fees. More recently, consumer’s meters, such as those on EDP’s (Energias de Portugal) InovCity\footnote{\url{http://www.inovcity.pt/}} project, send near real-time readings of about 32,000 customers directly to the energy supplier company. This project is taking place at Évora (Portugal), is a test bed for InovGrid\footnote{\url{http://www.edpdistribuicao.pt/pt/rede/InovGrid/Pages/InovGrid.aspx}} which is a larger project at European Union. InovGrid already led to an increase of efficiency levels in customers’ energy consumption by about 20\% as a result of increased awareness of power consumption. Replication of this project in Portugal is expected to account for 8\% of the Portuguese CO\textsubscript{2} reduction target by 2020 \cite{SmartGrid}. Such meter readings are communicated every 15 minutes, resulting on about 35 thousand readings during a year for a single customer. Taking into account all InovCity customers totals over 1 billion readings per year. Those meter readings are used for billing customers’ with real readings and not resorting to consumption estimates, display detailed consumption to customers and to adjust power consumption. Customers’ meters are used as part of the supply sensor network, contributing to detect and prevent problems such as faulty equipment and to adjust power production and distribution (Smart Grid\footnote{\url{https://www.smartgrid.gov/}}) accordingly to customer’s energy consumption/production, thus reducing CO\textsubscript{2} emissions. Besides for network management by the supplier company, the customer’s commercialization company may also use such detailed consumption information to speculate about the consequences of changing energy price by studying impact on customer’s consumption patterns.\\
\\
\textbf{!!!FALAR DE SEMANTICS!!!}
\\
There are two major classes of clients: B2B (Business to Business) and B2C (Business to Consumer). B2B refers to companies such as factories, restaurants, street advertising, agriculture, banks while B2C refers to regular family homes and small businesses such as coffee shops or stationeries such as those on InovCity. The energy consumption patterns for both are a completely different paradigm than for B2C. While B2B clients are far less than B2C, they consume far higher amounts of energy. Although their expectations are the same (pay as little as possible), the way they think is completely different and with that, served with different marketing strategies. B2C clients tend to have consumption patterns because humans are creatures of habits and this allows its analysis. On the other hand B2B clients are much more heterogeneous and have their own specific needs. Some even have seasonal activity such as in farming where for instance they can be active in the summer and inactive during the rest of the year, or vice-versa.\\
\\
The following sections detail about today’s problems regarding Big Data processing and how a Semantic approach can result on performance and scalability improvement when working with Big Data. On section 2 this document details the goals of this work. The analysis of related work is described in section 3. The architecture for this work is described in section 4. Then, on section 5 methods for evaluating practical results after implementation are presented. On section 6 are the conclusions reached and section 7 proposes a plan for implementing the solution allowing to draw conclusion from results.
%
\section{Goals}
%
Nowadays, electricity consumers realize how much energy they have consumed during a period only at billing time. In Portugal the regulatory entity (ERSE) demands that the customer is billed using at least one real reading every 3 months. If, in that period a user does not want to be billed using estimate readings, he must communicate his readings to the commercialization company. InovGrid customers have a meter that communicates readings in near real-time and they have access to their own consumption graph on a private area at a public website allowing him to overlap 2 months’ readings which helps to evidence bad consumption habits. Inspired on InovCity customers, with such meter reading frequency, this work aims at addressing the following issues:
\begin{enumerate}
	\item Use an electricity consumption dataset to formulate an ontology that describes the information within the dataset:
	\begin{enumerate}[(a)]
		\item The consumer entity.
		\item The consumption on specific time slices and how it is related to the consumer.
		\item Pre-defined time slices on which metrics are built as basis for determining consumption patterns (e.g.: previous day, previous month, and previous year).
		\item The different consumption patterns and how they are related to the consumption.
		\item The advices that will be inferred from a given customer situation which represents a pattern change.
	\end{enumerate}
	\item Develop automatic mechanisms to identify concepts based on consumption data:
	\begin{enumerate}[(a)]
		\item Interpret automatically identified concepts translating key concepts into intelligible concepts.
		\item Identify and verify which time slices contribute to distinguishable consumption patterns.
	\end{enumerate}
	\item Develop an application that warns and advises consumers about their consumption pattern changes that translate a degradation or improvement of energy consumption habits:
	\begin{enumerate}[(a)]
	\item Present monthly consumption graph to user and other relevant indicators influenced by his consumption, used for pattern classification. This information will allow the user to draw its own conclusions, similarly to InovCity.
	\item Present an advice to the user by inferring conclusions based on recent consumption and on historic consumption. These conclusions are to take into consideration all other consumers that have the same consumption pattern.
	\end{enumerate}
\end{enumerate}
This work focuses on B2C clients rather than B2B because B2C falls easier in the Big Data domain as consequence of producing much larger quantities of readings. Moreover, it is likely that B2C clients have better outlined consumption patterns, which hopefully result in a much more interesting set of conclusions. On top it is easier for us to see ourselves on such patterns.

Formulating an ontology requires a deep understanding of the domain being explored so one may design one that can answer the needs of the application that draws conclusions about the client’s consumption pattern changes. Since one lacks deep knowledge about B2C client’s consumption patterns, in general, those patterns must be found and studied by analyzing data within the dataset itself and only then the consumption pattern profiles may be designed and instantiated in the form of an ontology. In this work, consumption patterns will be identified and studied in the data staging phase so they are defined into an ontology. The consumption patterns definition must take into consideration that a customer’s consumption evolves with time and consequently so does the consumption patterns to be defined.


%
% ---- Bibliography ----
%
\begin{thebibliography}{5}
%
\bibitem {cox:ells}
Michael Cox, and David Ellsworth. \textit{Application-controlled demand paging for out-of-core visualization}, Proceedings of the 8th conference on Visualization '97, p.235-ff., October 18-24, 1997, Phoenix, Arizona, USA.

\bibitem {atlas}
Collaboration ATLAS. \textit{The ATLAS Experiment at the CERN Large Hadron Collider}. JINST. 2008;3:S08003 - S08003.

\bibitem {panda}
A.Klimentov, M.Borodin, K.De, S. Jha, D. Golubkov, T. Maeno, P. Nilsson, D. Oleynik, S. Panitkin, A. Petrosyan, J. Schovancova, A. Vaniachine and T. Wenaus. \textit{PanDA Beyond ATLAS: A Scalable Workload Management System For Data Intensive Science}. No. ATL-SOFT-SLIDE-2014-117. ATL-COM-SOFT-2014-010, 2014.

\bibitem {reuters}
Thomson Reuters. \textit{Thomson Reuters Annual Report 2013}, 12\textsuperscript{th} March, 2014.

\bibitem {worldstore}
M. Hilbert, and P. López. \textit{The world’s technological capacity to store, communicate, and compute information.} Science 332.6025 (2011): 60-65.

\bibitem {physicalcybersocial}
K. Thirunarayan, and A. Sheth. \textit{Semantics-Empowered Approaches to Big Data Processing for Physical-Cyber-Social Applications.} Proc. AAAI 2013 Fall Symp. Semantics for Big Data. 2013.

\bibitem {realtimerdf}
D. Gerber, S. Hellmann, L. Bühmann, T. Soru, R. Usbeck, and A.-C. Ngonga Ngomo. \textit{Real-time RDF extraction from unstructured data streams.} The Semantic Web–ISWC 2013. Springer Berlin Heidelberg, 2013. 135-150.

\bibitem {semiToXML}
R. Goldman, J. McHugh, and J. Widom. \textit{From semistructured data to XML: Migrating the Lore data model and query language.} (1999).

\bibitem {outageDetection}
Y. Zhao, R. Sevlian, R. Rajagopal, A. Goldsmith, and H. V. Poor. \textit{Outage detection in power distribution networks with optimally-deployed power flow sensors.} Power and Energy Society General Meeting (PES), 2013 IEEE. IEEE, 2013.

\bibitem {SmartGrid}
V. Giordano, F. Gangale, G. Fulli, and M. Sánchez Jiménez. Smart Grid projects in Europe: lessons learned and current developments. Publications Office, 2011.

%
\end{thebibliography}

\end{document}
